The AidData Lab at the College of William and Mary has led the work to probe into the mystery of Chinese foreign investment, publishing several versions of AidData’s Global Chinese Official Finance Dataset (2000-2014) with link attached below. In this study, we will try to learn more about the behaviors of official Chinese foreign investment using data wrangling, text mining, and machine learning techniques.
https://www.aiddata.org/data/chinese-global-official-finance-dataset
R)# Clear memory and run checks ---------------------------------------------
rm(list = ls())
# Load packages -----------------------------------------------------------
library(readxl)
library(dplyr)
library(ggplot2)
library(caret)
library(readxl)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
library(robustbase)
library(gbm)
library(earth)
library(randomForest)
# Load and clean data -----------------------------------------------------
flow <- readxl::read_excel("GlobalChineseOfficialFinanceDataset_v1.0/GlobalChineseOfficialFinanceDataset_v1.0.xlsx")
flow <- flow %>%
# We will only include projects recommended for research
filter(recommended_for_research == TRUE) %>%
# We will only include projects not part of others, which means umbrella == 0
filter(umbrella == 0) %>%
# We discard indistinguishable flow type projects
filter(flow_class != "Vague (Official Finance)") %>%
# We select only variables with effective numbers of values
select(year, funding_agency, implementing_agency,
recipient_condensed, recipient_region,
title, description, status, flow_class,
usd_defl_2014, usd_current, crs_sector_name,
start_actual, start_planned,
end_actual, end_planned,
is_cofinanced)
After performing data cleaning, the dataset is left with 4013 useful projects with 17 variables to be further analyzed.
sum_data <- flow %>% summary()
sum_data
## year funding_agency implementing_agency recipient_condensed
## Min. :2000 Length:4013 Length:4013 Length:4013
## 1st Qu.:2006 Class :character Class :character Class :character
## Median :2009 Mode :character Mode :character Mode :character
## Mean :2008
## 3rd Qu.:2012
## Max. :2014
##
## recipient_region title description
## Length:4013 Length:4013 Length:4013
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## status flow_class usd_defl_2014
## Length:4013 Length:4013 Min. :1.760e+02
## Class :character Class :character 1st Qu.:1.018e+06
## Mode :character Mode :character Median :8.134e+06
## Mean :1.253e+08
## 3rd Qu.:5.650e+07
## Max. :2.036e+10
## NA's :1640
## usd_current crs_sector_name start_actual
## Min. :1.400e+02 Length:4013 Min. :1996-01-01 00:00:00
## 1st Qu.:6.041e+05 Class :character 1st Qu.:2006-09-15 12:00:00
## Median :5.000e+06 Mode :character Median :2009-04-20 00:00:00
## Mean :1.002e+08 Mean :2008-11-12 04:56:50
## 3rd Qu.:3.990e+07 3rd Qu.:2011-06-19 18:00:00
## Max. :1.500e+10 Max. :2016-03-23 00:00:00
## NA's :1640 NA's :3329
## start_planned end_actual
## Min. :2000-01-01 00:00:00 Min. :2000-01-01 00:00:00
## 1st Qu.:2006-07-16 12:00:00 1st Qu.:2006-08-17 00:00:00
## Median :2008-09-16 00:00:00 Median :2009-08-30 12:00:00
## Mean :2008-08-07 01:00:27 Mean :2009-02-06 06:14:19
## 3rd Qu.:2011-02-04 18:00:00 3rd Qu.:2011-10-30 12:00:00
## Max. :2015-11-23 00:00:00 Max. :2016-03-31 00:00:00
## NA's :3751 NA's :3159
## end_planned is_cofinanced
## Min. :2000-03-26 00:00:00 Mode :logical
## 1st Qu.:2008-06-23 12:00:00 FALSE:3835
## Median :2011-09-03 12:00:00 TRUE :178
## Mean :2011-06-16 02:22:56
## 3rd Qu.:2014-08-04 00:00:00
## Max. :2020-09-09 00:00:00
## NA's :3741
Total Projects’ Number by Years
pro_num_year <- flow %>%
select(year, usd_defl_2014) %>%
group_by(year) %>%
count()
colnames(pro_num_year) = c("Year",
"Number of Projects")
pro_num_year
## # A tibble: 15 x 2
## # Groups: year [15]
## Year `Number of Projects`
## <dbl> <int>
## 1 2000 101
## 2 2001 135
## 3 2002 134
## 4 2003 181
## 5 2004 181
## 6 2005 242
## 7 2006 314
## 8 2007 327
## 9 2008 278
## 10 2009 383
## 11 2010 342
## 12 2011 390
## 13 2012 336
## 14 2013 342
## 15 2014 327
gg1 <- ggplot(flow) +
geom_histogram(mapping = aes(x = year),
binwidth = 0.5, color = "darkblue", fill = "lightblue") +
ggtitle("Number of Projects by Years") +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
xlab("Year") + ylab("Number of Projects")
gg1
The data with the graph show an overall increasing pattern for the number of projects by years, with stepbacks on 2008 due to the recession, on 2010, and 2014.
Snapshot of projects’ value by 2014 USD by years, including the total, average, min, and max value
pro_value_year <- flow %>%
select(year, usd_defl_2014) %>%
group_by(year) %>%
summarize(sum = sum(usd_defl_2014, na.rm = T),
avg = mean(usd_defl_2014, na.rm = T),
min = min(usd_defl_2014, na.rm = T),
max = max(usd_defl_2014, na.rm = T)) %>%
arrange(year)
colnames(pro_value_year) = c("Year",
"Total Value of Project",
"Average Value of Project",
"Minimum Value of Project",
"Maximum Value of Project")
left_join(pro_num_year, pro_value_year, by = "Year")
## # A tibble: 15 x 6
## # Groups: Total Value of Project [15]
## Year `Number of Proj… `Total Value of… `Average Value …
## <dbl> <int> <dbl> <dbl>
## 1 2000 101 2337247008. 32461764.
## 2 2001 135 5455489633. 50050364.
## 3 2002 134 4376162094. 46064864.
## 4 2003 181 5525240301. 46824070.
## 5 2004 181 4520528562. 40004678.
## 6 2005 242 8045518295. 49663693.
## 7 2006 314 13831559135. 73965557.
## 8 2007 327 15695022812. 75821366.
## 9 2008 278 8530799541. 55756860.
## 10 2009 383 64117820118. 276369914.
## 11 2010 342 24826825668. 134928400.
## 12 2011 390 44008421953. 202803788.
## 13 2012 336 37135252439. 200731094.
## 14 2013 342 27958099383. 158852837.
## 15 2014 327 30997892020. 190171117.
## # … with 2 more variables: `Minimum Value of Project` <dbl>, `Maximum
## # Value of Project` <dbl>
yearSum <- flow %>%
group_by(year) %>%
summarize(sum = sum(usd_defl_2014, na.rm = TRUE))
ggplot(yearSum, mapping = aes(x = year, y = sum)) +
geom_bar(stat = "identity") +
ggtitle("Graph 1: Total Value of Projects by Years") +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
xlab("Year") + ylab("Total Value of Projects")
gg3 <- ggplot(flow,
mapping = aes(x = year, y = usd_defl_2014,
group = year, color = year)) +
geom_point(na.rm = TRUE) +
ggtitle("Graph 2: Total Value of Projects by Years") +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
xlab("Year") + ylab("Total Value of Projects")
gg3
The data show an overall increasing pattern for the total value of projects by years, with a few stepbacks along the way. Graph 1 displays a clearer picture of the overall pattern as we can see the highest flow happened in 2009, which followed the sharp recession in 2008 and followed by a sharp recession in 2010. Graph 2 indicates that there are some outliers among projects in years, such as 2009, which helps to explain the reason why there is a peak in 2009 for the total value of projects.
Total projects’ number by regions
pro_num_region <- flow %>%
select(recipient_region, usd_defl_2014) %>%
group_by(flow$recipient_region) %>%
count() %>%
arrange(desc(n))
colnames(pro_num_region) = c("Region",
"Number of Projects")
pro_num_region
## # A tibble: 6 x 2
## # Groups: flow$recipient_region [6]
## Region `Number of Projects`
## <chr> <int>
## 1 Africa 2168
## 2 Asia 1035
## 3 Latin America and the Caribbean 295
## 4 The Pacific 256
## 5 Central and Eastern Europe 166
## 6 Middle East 93
gg4 <- ggplot(flow) +
geom_histogram(mapping = aes(x = recipient_region),
color = "darkblue", fill = "lightblue", stat = "count") +
ggtitle("Number of Projects by Regions") +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
xlab("Region") + ylab("Number of Projects") +
scale_x_discrete(labels = abbreviate)
gg4
From the data, we can see that Africa receives the most number of projects. From the graph, we can see that Africa has more than twice as many projects as Asia, the region with the second most projects.
Total projects’ value by 2014 USD by regions
pro_value_region <- flow %>%
select(recipient_region, usd_defl_2014) %>%
group_by(flow$recipient_region) %>%
summarize(sum = sum(usd_defl_2014, na.rm = T)) %>%
arrange(desc(sum))
colnames(pro_value_region) = c("Region",
"Total Value of Projects")
pro_value_region
## # A tibble: 6 x 2
## Region `Total Value of Projects`
## <chr> <dbl>
## 1 Asia 95822989635.
## 2 Africa 92713211229.
## 3 Central and Eastern Europe 55569462403.
## 4 Latin America and the Caribbean 50403066261.
## 5 The Pacific 2337839964.
## 6 Middle East 515309469.
regionSum <- flow %>% group_by(recipient_region) %>%
summarize(sum = sum(usd_defl_2014, na.rm = TRUE))
pie(regionSum$sum, labels = regionSum$recipient_region,
radius = 1, cex = 0.5, main = "Total Value of Projects by Regions")
library(plotly)
plot_ly(regionSum, labels = ~ recipient_region,
values = ~ sum, type = 'pie',
textposition = 'outside',textinfo = 'label') %>%
layout(title = 'Total Value of Projects by Regions',
xaxis = list(showgrid = FALSE, zeroline = FALSE,
showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE,
showticklabels = FALSE))
From the data, we can see that Asia receives the highest value of projects. It is evident to see in the pie chart that Asia and Africa have almost equal, significant share of the total value of projects, while the Pacific and Middle East are the regions with the least share of value.
Average projects’ value by 2014 USD by regions
avg_pro_value_region <- left_join(pro_num_region,
pro_value_region,
by = "Region") %>%
mutate(`Average Project Value` = `Total Value of Projects`/`Number of Projects`) %>%
arrange(desc(`Average Project Value`)) %>%
select(Region, `Average Project Value`)
avg_pro_value_region
## # A tibble: 6 x 3
## # Groups: Total Value of Projects [6]
## `Total Value of Project… Region `Average Project Val…
## <dbl> <chr> <dbl>
## 1 55569462403. Central and Eastern Europe 334755798.
## 2 50403066261. Latin America and the Car… 170857852.
## 3 95822989635. Asia 92582599.
## 4 92713211229. Africa 42764396.
## 5 2337839964. The Pacific 9132187.
## 6 515309469. Middle East 5540962.
Combining data from the number and the total value of projects by regions, Central and Eastern Europe is the region that receives the highest average value of projects.
Total projects’ number by 2014 USD by countries
pro_num_country <- flow %>%
group_by(flow$recipient_condensed) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
colnames(pro_num_country) = c("Country",
"Total Number of Projects")
pro_num_country
## # A tibble: 10 x 2
## # Groups: flow$recipient_condensed [10]
## Country `Total Number of Projects`
## <chr> <int>
## 1 Cambodia 166
## 2 Zimbabwe 115
## 3 Pakistan 108
## 4 Angola 104
## 5 Tanzania 94
## 6 Ghana 86
## 7 Liberia 83
## 8 Uganda 80
## 9 Kenya 78
## 10 Sri Lanka 78
Cambodia ranks the first in terms of the number of projects carried out in the country.
Total projects’ value by 2014 USD by countries
pro_value_country <- flow %>%
group_by(flow$recipient_condensed) %>%
summarize(sum = sum(usd_defl_2014, na.rm = T)) %>%
arrange(desc(sum)) %>%
head(10)
colnames(pro_value_country) = c("Country",
"Total Value of Projects")
pro_value_country
## # A tibble: 10 x 2
## Country `Total Value of Projects`
## <chr> <dbl>
## 1 Russia 36622577666.
## 2 Pakistan 18679406540.
## 3 Angola 14293955203.
## 4 Laos 11632573144.
## 5 Sri Lanka 11014082678.
## 6 Venezuela 10768596440
## 7 Turkmenistan 10676427454.
## 8 Ecuador 9698347826.
## 9 Brazil 8528111506.
## 10 Cambodia 7973834856.
In terms of total value of the projects, Cambodia only ranks as number 10 in the table, while Russia receives Chinese foreign investment the most money-wise.
Average projects’ value by 2014 USD by countries
pro_num_country <- flow %>%
group_by(flow$recipient_condensed) %>%
count() %>%
arrange(desc(n))
colnames(pro_num_country) = c("Country",
"Total Number of Projects")
pro_value_country <- flow %>%
group_by(flow$recipient_condensed) %>%
summarize(sum = sum(usd_defl_2014, na.rm = T)) %>%
arrange(desc(sum))
colnames(pro_value_country) = c("Country",
"Total Value of Projects")
avg_pro_value_country <- left_join(pro_num_country,
pro_value_country,
by = "Country") %>%
mutate(`Average Project Value` = `Total Value of Projects`/`Total Number of Projects`) %>%
arrange(desc(`Average Project Value`)) %>%
select(Country, `Average Project Value`) %>%
head(10)
avg_pro_value_country
## # A tibble: 10 x 3
## # Groups: Total Value of Projects [10]
## `Total Value of Projects` Country `Average Project Value`
## <dbl> <chr> <dbl>
## 1 36622577666. Russia 2441505178.
## 2 6681307048. Cuba 1670326762.
## 3 10676427454. Turkmenistan 1334553432.
## 4 4637474150. Argentina 772912358.
## 5 8528111506. Brazil 568540767.
## 6 10768596440 Venezuela 468199845.
## 7 9698347826. Ecuador 461826087.
## 8 6735422689. Kazakhstan 420963918.
## 9 5569237484. India 397802677.
## 10 2878552493. Bahamas 319839166.
This table shows the evidence that Russia receives the highest average value per project, which is likely to be one of the major reasons behind why Central and Eastern Europe is the region that receives the highest average value of projects from China.
title <- tibble(txt = flow$title)
description <- tibble(txt = flow$description)
title <- title %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
title
## # A tibble: 4,839 x 2
## word n
## <chr> <int>
## 1 china 2785
## 2 million 982
## 3 donates 745
## 4 linked 581
## 5 usd 571
## 6 chinese 437
## 7 project 418
## 8 medical 365
## 9 grants 351
## 10 construction 349
## # … with 4,829 more rows
title %>%
with(wordcloud(word, n, max.words = 100,
colors = brewer.pal(8, "Dark2")))
There is a snapshot of topics mentioned in the tile of all projects. Key words, such as donates, medical, and construction, are worth paying attention.
description <- description %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
description
## # A tibble: 19,002 x 2
## word n
## <chr> <int>
## 1 china 4929
## 2 chinese 3899
## 3 project 3350
## 4 million 3079
## 5 government 1759
## 6 signed 1702
## 7 loan 1660
## 8 construction 1649
## 9 usd 1483
## 10 agreement 1380
## # … with 18,992 more rows
description %>%
with(wordcloud(word, n, max.words = 100,
colors = brewer.pal(8, "Dark2")))
There is a snapshot of topics mentioned in the description of all projects. Key words, such as contruction, equipment, development, bank, medical, ambassador, and government, are worth paying attention.
Whether a project is ODA or not?
ODA means Official Development Assistance, the major type of official development finance, as opposed to Other Official Flows (OOF). The Global Chinese Official Finance Dataset includes a datset on China’s Official Development Assistance (ODA)-like projects and a dataset on China’s Other Official Flows (OOF)-like projects. The reason projects are divided into ODA-like, OOF-like is that ODA and OOF are OECD countries definition. ODAs are traditionally referred as aids, whereas OOF are just investment. Since Chinese aids and investment do not use such a standard, researchers at AidData have to discern which are which by themselves so that the data could be comparable to western investment and aids. There are about 4000 plus ODA projects and 1100 OOF projects covering the timespan of 2000 to 2014.
Rationale: Since the outcome variable in this predictive model is binary, this study chooses following models to run.
Steps: For each model, it follows four steps: 1) Create new variables and split data into training and testing data; 2) Optimize the model through training data; 3) Predict the model using testing data; 4) Evaluate the model by comparing training and testing data
Data Cleaning
flow_type_dummy <- ifelse(flow$flow_class == "ODA-like", 1, 0)
is_cofinanced_dummy <- ifelse(flow$is_cofinanced == TRUE, 1, 0)
new_flow <- cbind(flow, flow_type_dummy,
is_cofinanced_dummy) %>%
data.frame()
new_flow <- new_flow %>%
filter(is.na(usd_defl_2014) == FALSE) %>%
filter(is.na(usd_current) == FALSE)
new_flow$flow_type_dummy <- factor(new_flow$flow_type_dummy,
labels = c("isODA", "notODA"),
levels = 1:0)
set.seed(12345)
in_train <- createDataPartition(y = new_flow$flow_type_dummy,
p = 0.8, list = FALSE)
training <- new_flow[ in_train, ]
testing <- new_flow[-in_train, ]
Logit Model
logit <- glm(training$flow_type_dummy == "isODA" ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, family = binomial(link = "logit"))
y_hat_logit <- predict(logit, newdata = testing, type = "response")
z_logit <- factor(y_hat_logit > 0.5,
levels = c(TRUE, FALSE),
labels = c("isODA", "notODA"))
confusionMatrix(z_logit, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 365 57
## notODA 10 42
##
## Accuracy : 0.8586
## 95% CI : (0.824, 0.8887)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0001027
##
## Kappa : 0.4817
##
## Mcnemar's Test P-Value : 1.912e-08
##
## Sensitivity : 0.9733
## Specificity : 0.4242
## Pos Pred Value : 0.8649
## Neg Pred Value : 0.8077
## Prevalence : 0.7911
## Detection Rate : 0.7700
## Detection Prevalence : 0.8903
## Balanced Accuracy : 0.6988
##
## 'Positive' Class : isODA
##
Penalized Logit Model
ctrl_logit <- trainControl(method = "repeatedcv", repeats = 3,
classProbs = TRUE, summaryFunction = twoClassSummary)
Grid_logit <- expand.grid(.lambda = seq(0, 1, length = 10),
.alpha = seq(0, 1, length = 10))
penalized_logit <- train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "glmnet",
trControl = ctrl_logit, tuneGrid = Grid_logit,
metric = "ROC", preProcess = c("center", "scale"))
z_penalized_logit <- predict(penalized_logit, newdata = testing)
confusionMatrix(z_penalized_logit, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 365 56
## notODA 10 43
##
## Accuracy : 0.8608
## 95% CI : (0.8263, 0.8906)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 6.231e-05
##
## Kappa : 0.4918
##
## Mcnemar's Test P-Value : 3.040e-08
##
## Sensitivity : 0.9733
## Specificity : 0.4343
## Pos Pred Value : 0.8670
## Neg Pred Value : 0.8113
## Prevalence : 0.7911
## Detection Rate : 0.7700
## Detection Prevalence : 0.8882
## Balanced Accuracy : 0.7038
##
## 'Positive' Class : isODA
##
Linear Probability Model (LPM)
ols <- lm(training$flow_type_dummy == "isODA" ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training)
y_hat_ols <- predict(ols, newdata = testing)
z_ols <- factor(y_hat_ols > 0.5,
levels = c(TRUE, FALSE),
labels = c("isODA", "notODA"))
confusionMatrix(z_ols, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 366 62
## notODA 9 37
##
## Accuracy : 0.8502
## 95% CI : (0.8149, 0.8811)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0006452
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 6.775e-10
##
## Sensitivity : 0.9760
## Specificity : 0.3737
## Pos Pred Value : 0.8551
## Neg Pred Value : 0.8043
## Prevalence : 0.7911
## Detection Rate : 0.7722
## Detection Prevalence : 0.9030
## Balanced Accuracy : 0.6749
##
## 'Positive' Class : isODA
##
Robust LPM
ols <- lm(training$flow_type_dummy == "isODA" ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training)
robust <- lmrob(training$flow_type_dummy == "isODA" ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training)
y_hat_robust <- predict(robust, newdata = testing)
z_robust <- factor(y_hat_ols > 0.5,
levels = c(TRUE, FALSE),
labels = c("isODA", "notODA"))
confusionMatrix(z_robust, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 366 62
## notODA 9 37
##
## Accuracy : 0.8502
## 95% CI : (0.8149, 0.8811)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0006452
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 6.775e-10
##
## Sensitivity : 0.9760
## Specificity : 0.3737
## Pos Pred Value : 0.8551
## Neg Pred Value : 0.8043
## Prevalence : 0.7911
## Detection Rate : 0.7722
## Detection Prevalence : 0.9030
## Balanced Accuracy : 0.6749
##
## 'Positive' Class : isODA
##
LDA
LDA <- train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "lda",
preProcess = c("center", "scale"))
z_LDA <- predict(LDA, newdata = testing)
confusionMatrix(z_LDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 354 54
## notODA 21 45
##
## Accuracy : 0.8418
## 95% CI : (0.8057, 0.8735)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0031562
##
## Kappa : 0.4543
##
## Mcnemar's Test P-Value : 0.0002199
##
## Sensitivity : 0.9440
## Specificity : 0.4545
## Pos Pred Value : 0.8676
## Neg Pred Value : 0.6818
## Prevalence : 0.7911
## Detection Rate : 0.7468
## Detection Prevalence : 0.8608
## Balanced Accuracy : 0.6993
##
## 'Positive' Class : isODA
##
QDA
QDA <- train(flow_type_dummy ~
usd_defl_2014 + usd_current,
data = training, method = "qda",
preProcess = c("center", "scale"))
z_QDA <- predict(QDA, newdata = testing)
confusionMatrix(z_QDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 369 78
## notODA 6 21
##
## Accuracy : 0.8228
## 95% CI : (0.7854, 0.8561)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0486
##
## Kappa : 0.2678
##
## Mcnemar's Test P-Value : 9.429e-15
##
## Sensitivity : 0.9840
## Specificity : 0.2121
## Pos Pred Value : 0.8255
## Neg Pred Value : 0.7778
## Prevalence : 0.7911
## Detection Rate : 0.7785
## Detection Prevalence : 0.9430
## Balanced Accuracy : 0.5981
##
## 'Positive' Class : isODA
##
PLSDA
ctrl_PLSDA <- trainControl(method = "repeatedcv", repeats = 2,
classProbs = TRUE, summaryFunction = twoClassSummary)
PLSDA_grid <- expand.grid(.ncomp = 1:2)
PLSDA <- train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "pls",
preProcess = c("center", "scale"), metric = "ROC",
trControl = ctrl_PLSDA, tuneGrid = PLSDA_grid)
z_PLSDA <- predict(PLSDA, newdata = testing)
confusionMatrix(z_PLSDA, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 372 72
## notODA 3 27
##
## Accuracy : 0.8418
## 95% CI : (0.8057, 0.8735)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.003156
##
## Kappa : 0.356
##
## Mcnemar's Test P-Value : 4.096e-15
##
## Sensitivity : 0.9920
## Specificity : 0.2727
## Pos Pred Value : 0.8378
## Neg Pred Value : 0.9000
## Prevalence : 0.7911
## Detection Rate : 0.7848
## Detection Prevalence : 0.9367
## Balanced Accuracy : 0.6324
##
## 'Positive' Class : isODA
##
Nearest Shrunken Centroids
ctrl_NSC <- trainControl(method = "repeatedcv", repeats = 3,
classProbs = TRUE, summaryFunction = twoClassSummary)
grid_NSC <- data.frame(.threshold = 0:25)
NSC <- train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "pam",
preProcess = c("center", "scale"), metric = "ROC",
trControl = ctrl_NSC, tuneGrid = grid_NSC)
## 1111111111111111111111111111111
z_NSC <- predict(NSC, newdata = testing)
confusionMatrix(z_NSC, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 375 98
## notODA 0 1
##
## Accuracy : 0.7932
## 95% CI : (0.754, 0.8288)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.4818
##
## Kappa : 0.0159
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0101
## Pos Pred Value : 0.7928
## Neg Pred Value : 1.0000
## Prevalence : 0.7911
## Detection Rate : 0.7911
## Detection Prevalence : 0.9979
## Balanced Accuracy : 0.5051
##
## 'Positive' Class : isODA
##
Boosting
gbmControl = trainControl(method="cv",
number=5, returnResamp = "all")
gbm = train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data=training, method="gbm",
distribution="bernoulli",
trControl=gbmControl, verbose=F,
tuneGrid=data.frame(.n.trees=seq(100, 1000,
by = 100),
.shrinkage=c(0.01, 0.1),
.interaction.depth=1:2,
.n.minobsinnode=1:5),
na.action = na.omit)
y_hat_gbm <- predict(gbm, newdata = testing, na.action = na.pass)
confusionMatrix(y_hat_gbm, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 354 47
## notODA 21 52
##
## Accuracy : 0.8565
## 95% CI : (0.8217, 0.8868)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0001666
##
## Kappa : 0.5195
##
## Mcnemar's Test P-Value : 0.0024318
##
## Sensitivity : 0.9440
## Specificity : 0.5253
## Pos Pred Value : 0.8828
## Neg Pred Value : 0.7123
## Prevalence : 0.7911
## Detection Rate : 0.7468
## Detection Prevalence : 0.8460
## Balanced Accuracy : 0.7346
##
## 'Positive' Class : isODA
##
Random Forest
rf<- randomForest(flow_type_dummy ~
usd_defl_2014 + usd_current +
is_cofinanced_dummy,
data=training,
importance = TRUE,
na.action = na.omit)
y_hat_rf <- predict(rf, newdata = testing,
type = "response", na.action = na.pass)
confusionMatrix(y_hat_rf, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 349 51
## notODA 26 48
##
## Accuracy : 0.8376
## 95% CI : (0.8012, 0.8696)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.006386
##
## Kappa : 0.4581
##
## Mcnemar's Test P-Value : 0.006237
##
## Sensitivity : 0.9307
## Specificity : 0.4848
## Pos Pred Value : 0.8725
## Neg Pred Value : 0.6486
## Prevalence : 0.7911
## Detection Rate : 0.7363
## Detection Prevalence : 0.8439
## Balanced Accuracy : 0.7078
##
## 'Positive' Class : isODA
##
Neural Network Model
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10))
ctrl_nn <- trainControl(method = "cv", number = 10)
nn <- train(flow_type_dummy ~
usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "nnet",
trControl = ctrl_nn, tuneGrid = nnetGrid,
preProcess = c("center", "scale"), trace = FALSE)
y_hat_nn <- predict(nn, newdata = testing)
confusionMatrix(y_hat_nn, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 353 48
## notODA 22 51
##
## Accuracy : 0.8523
## 95% CI : (0.8171, 0.883)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.0004175
##
## Kappa : 0.5053
##
## Mcnemar's Test P-Value : 0.0028074
##
## Sensitivity : 0.9413
## Specificity : 0.5152
## Pos Pred Value : 0.8803
## Neg Pred Value : 0.6986
## Prevalence : 0.7911
## Detection Rate : 0.7447
## Detection Prevalence : 0.8460
## Balanced Accuracy : 0.7282
##
## 'Positive' Class : isODA
##
MARS model
ctrl_mars <- trainControl(method = "cv", number = 10)
marsGrid <- expand.grid(.degree = 1:3, .nprune = 1:10)
mars <- train(flow_type_dummy ~ usd_defl_2014 + usd_current +
as.factor(recipient_region) + as.factor(year) +
as.factor(crs_sector_name) + is_cofinanced_dummy,
data = training, method = "earth",
trControl = ctrl_mars, tuneGrid = marsGrid)
y_hat_mars <- predict(mars, newdata = testing)
confusionMatrix(y_hat_mars, reference = testing$flow_type_dummy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction isODA notODA
## isODA 356 59
## notODA 19 40
##
## Accuracy : 0.8354
## 95% CI : (0.7989, 0.8677)
## No Information Rate : 0.7911
## P-Value [Acc > NIR] : 0.00889
##
## Kappa : 0.4151
##
## Mcnemar's Test P-Value : 1.006e-05
##
## Sensitivity : 0.9493
## Specificity : 0.4040
## Pos Pred Value : 0.8578
## Neg Pred Value : 0.6780
## Prevalence : 0.7911
## Detection Rate : 0.7511
## Detection Prevalence : 0.8755
## Balanced Accuracy : 0.6767
##
## 'Positive' Class : isODA
##
According to the rank of the overall predication accuracy, the penalized logit model and boosting model perform the best in predicting whether the project is an ODA or not with the highest accuracy. Meanwhile notably, neural network model has the highest balanced accuracy among all models. With a relatively high level of prediction accuracy, this study will help categorize Chinese foreign investment in the future. As China does not classify its foreign investment by the OECD standard, it is very hard to directly compare its investment efficiency with aid efficiency of Western countries or that of the World Bank. Therefore, this study would, to a certain degree, help address this problem, and provide a benchmark for researchers to compare Chinese and Western foreign investment in the right categories.